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Abstract 

MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a 
reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing 
technologies, including lllumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific Biosciences SMRT. Indeed, 
MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low- 
coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash 
clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well 
as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK 
provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating 
outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement 
confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality 
scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. 
In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping 
quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our 
command and pipeline launcher system GKNO (http://gkno.me). 
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Introduction 

The widespread availability of next-generation sequencing 
platforms has revolutionized the life sciences through the ever 
more accessible ultra-high throughput DNA sequencing efforts 
[1]. Next-generation sequencing technologies including lllumina, 
Complete Genomics, and Applied Biosystems (AB) SOLiD have 
been driving the current market forward, whereas Pacific 
Biosciences SMRT [2], Ion Torrent [3], and Nanopores [4] are 
leading the development of third-generation sequencing instru- 
ments. These technologies bring novel opportunities for many 
applications including genetic variant discovery, epigenomic 
variant discovery, RNA-Seq and ChlP-Seq, but also provide 
complex computational challenges. The short reads generated by 
these technologies are generally aligned to a reference genome as 
an early step in many of the current analysis workflows and the 
alignment quality limits the accuracy of any downstream analysis. 
Large sequencing projects often use sequencing machines from 
multiple manufacturers for data generation and can also make use 
of legacy data. It is desirable that any researcher tasked with 
analyzing the available data need not learn the intricacies of 
multiple alignment software packages to utilize all of the available 
data. This is unnecessary, since, MOSAIK can, uniquely, 
accurately align sequencing data from all current and legacy 
platforms. 



Current sequencing technologies typically generate on the order 
of hundreds of millions of short reads (of the order of a few 
hundred nucleotides or shorter) on a single run. In order to 
analyze all of these reads in a reasonable amount of computational 
time, the performance of reference-guided alignment programs is 
paramount. The memory footprint of these algorithms must also 
be well managed to allow their deployment beyond institutions 
with extremely expensive computational infrastructure. These 
goals must be met without compromising the accuracy of resulting 
alignments. Most existing aligners utilize hashing algorithms or the 
Burrows- Wheeler transform [5,6] to search exact matches 
(algorithms may be modified to allow few mismatches) as their 
first step to achieve high performance and optimize memory 
usage. Theoretically, hashing method outperforms BWT method 
for DNA database searching [7]. The hash-based aligners, Eland 
(AJ Cox, lllumina, San Diego), MAQ, [8], mrFAST/mrsFast 
[9,10], SHRiMP [11,12], and ZOOM [13,14] hash reads and fit 
these hashes to the reference genome, while MOM [15], 
MOSAIK, PASS [16], ProbeMatch [17], SOAP [18], SRmapper 
[19], and STAMPY [20] hash the reference genome and store this 
for comparison with reads. Major Burrows-Wheeler transform 
(BWT) based aligners include BWA [21], Bowtie [22,23], 
segemehl [24] and SOAP2 [25]. In general, BWT-based aligners 
are sensitive but include a slow query operation (each FM-index 
query is slower than a hash query [26,27]). In regions with 
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genomic variation (e.g. those regions in which the investigator is 
usually most interested), maintaining good performance generally 
leads to lower sensitivity [19,28]. In addition, the Burrows- 
Wheeler transform method is less flexible than hash based 
methods. For example, it is more difficult for the Burrows- 
Wheeler transform to consider ambiguities by using IUPAC [29] 
ambiguity codes representing, for example, known SNPs. The 
main drawback of hash-based aligners is that they usually consume 
more memory than BWT-based aligners; however, as high- 
memory machines become cheaper, this becomes less of a 
problem. Currently, MOSAIK can be operated in a low-memory 
mode that keeps the memory footprint small (~8Gb for the 
human genome), ensuring that even for lower memory machines, 
MOSAIK can still be used with confidence. 

Here, we introduce a reference-guided aligner, MOSAIK, that 
is highly sensitive, stable and flexible, whose utility on a range of 
different sequencing technologies has been demonstrated in the 
context of the 1000 Genomes Project [30,31]. In addition to 
MOSAIK's ability to map data from all major sequencing 
technologies, it has been developed to address many of the issues 
currently facing genome researchers. These developments are 
outlined here. The primary goal of any mapping software is to 
minimize alignment artefacts and increase alignment sensitivity 
and accuracy. To achieve this, MOSAIK uses a Smith-Waterman 
algorithm and is able to align reads to a genome using IUPAC 
ambiguity codes, ensuring that alignments against known single- 
nucleotide polymorphisms (SNPs) are not penalized. Using this method, 
MOSAIK achieves positive predictive values (PPVs) of 99.5% for 
all alignments and 100.0% for high confidence alignments (those 
with a mapping qualities larger than 20) in experiments on 
simulated data. In addition to providing the genomic coordinates 
of the read mapping, it also important to provide a measure of the 
confidence in this coordinate. For this purpose, MOSAIK uses a 
neural-network based training scheme to provide well-calibrated 
mapping quality scores. In our experiments, the correlation 
coefficient between the quality scores assigned by MOSAIK and 
the actual scores is 0.97. To ensure that studies of any genome are 
supported, MOSAIK provides a training pipeline to ensure 
optimal mapping quality scores for the genome under investiga- 
tion. A major area of active investigation is the study of structural 
variation (SV). MOSAIK has been designed to aid and simplify 
the discovery of such variants. In particular, known-insertion 
sequences, for example, mobile element insertions (MEIs), can be 
included as part of the reference genome. This helps to minimize 
alignment artefacts, but MOSAIK also provides a host of valuable 
information to user on the paired-end reads that map to one of 
these sequences. When requested, MOSAIK also outputs all 
possible mapping locations for every read in a separate BAM file. 
This is essential for determining the mappability of the genome 
under study. The most recent versions of BWA, BOWTIE and 
MOSAIK are comparable in their run times, and STAMPY is 
approximately six times slower. Finally, MOSAIK is implemented 
in C++ as a modular suite of programs that is dual licensed under 
the GNU General Public License and MIT License. It is multi- 
threaded, open source, and incorporated into our command and 
pipeline launcher system GKNO (http://gkno.me). 

Results 

Alignments from all sequencing technologies 

All of the available sequencing technologies use different 
techniques for library preparation, paired-end read protocols and 
DNA sequencing, resulting in a range of read lengths, fragment 
lengths, base quality assignments, as well as different error profiles. 



Additionally, not all technologies report their sequencing reads in 
the conventional basespace (strings of the A, G, C and T 
nucleotides) format. Notably, AB SOLiD uses a di-base encoding 
scheme known as colorspace and single-molecule sequencing 
technologies use dark bases [32] for bases not registered by the 
instrument. These facts mean that all of the currently available 
aligners are tailored for use on data from one, or a small number of 
the available technologies. MOSAIK is the only aligner that can be 
used in a consistent manner across most of these technologies. 

In addition to the second-generation technologies, Illumina, 
Roche 454 and AB SOLiD, MOSAIK can also be deployed on 
third-generation technologies, in particular, Pacific Biosciences 
and Ion Torrent reads. MOSAIK uses the same algorithmic 
approach for all sequencing technologies, however, since the 
characteristics of each technology are different, the resultant 
alignment rates vary, as shown in Table 1 . These alignment rates 
were generated using Illumina paired-end (PE), single-end (SE) 
and Roche 454 SE reads generated using the MASON read 
simulator (http://www.seqan.de/projects/mason/) as well as 
Illumina and AB SOLiD reads from the Han Chinese in Beijing 
(CHB) population from the 1000 Genomes Project. For the third- 
generation technologies, we used E. coli reads provided by Ion 
Torrent (http://www.iontorrent.com/applications-pgm-accuracy/) 
and V. chokrae reads provided by Pacific Biosciences (ftp:// 
ftp.ncbi.nlm.nih.gov/sra/Submissions/SRA026/SRA026766/pro 
visional/SRX032454/SRR075103/). 

In general, sequencing reads containing fewer sequencing errors 
have higher alignment rates, e.g. Illumina reads, and longer or 
paired-end reads require more time to align. That paired-end 
reads take additional time is not unexpected. If one of the reads in 
a pair cannot be mapped unambiguously, additional searches are 
performed guided by the mapped mate in the pair. The additional 
processing time results in more accurate alignments as well as a 
lower fraction of unaligned reads. AB SOLiD reads are aligned in 
colorspace (converting to basespace prior to alignment loses all of 
the benefits of colorspace), but additional processing is required 
due to the required conversion of the alignments into basespace 
post-alignment. These experiments show that MOSAIK works 
well for existing sequencing technologies. 

Highly accurate alignments on simulated data 

To investigate the accuracy of reads aligned using MOSAIK, 
we simulated a total of 1 2 million Illumina paired-end reads from 
chromosome 20 of the Hgl9 human genome using the MASON 
read simulator. Reads of length 76 and 100 basepairs were 
simulated with a haplotype SNP rate of 0.1%. The reads were 
aligned against the entire human genome using BWA-0.5.9, 
BOWTIE-2.0-beta5, STAMPY- 1.0. 13, and MOSAIK-2. 1.78. 
The default parameter settings were used for all of the aligners. 
The positive predictive value of each aligner was then calculated as 
the number of correctly placed reads (the genomic coordinate of 
the mapped read agreed with the known location of the read from 
MASON) divided by the total number of mapped reads. Notice 
that an alignment is considered incorrect as the aligned position is 
out the 20 bp tolerant window and thus alignments with more 
than 20 bp unmapped bases may be considered as incorrect. We 
choose 20 bp as the tolerant window since on the dataset most of 
alignments contain fewer than 20 bp clipped bases (see supple- 
mental Figure SI). 

Figure 1 shows the positive predictive value (PPV, the number 
of correctly mapped reads divided by the total number of mapped 
reads) of the aligners as a function of mapping quality cutoffs 
(complete information is shown in Figure S2). At a mapping 
quality cutoff of twenty, for example, the PPV is calculated using 



PLOS ONE | www.plosone.org 



2 



March 2014 | Volume 9 | Issue 3 | e90581 



MOSAIK: An Accurate NGS Short-Read Mapper 



Table 1 . Summary of the alignment accuracies achieved by MOSAIK for reads generated from different sequencing technologies. 





Technologies 


Aligned (%) 


Speed (reads/second) 


Read lengths [min;max] 


Reference genome 


Dataset 


lllumina; PE 


99.98 


83.95 


100; 50 


Human hgl9 


MASON simulated 


lllumina; SE 


99.75 


153.98 


100; 76; 50 


Human hg19 


MASON simulated 


lllumina; PE/SE 


91.48 


147.42 


81; 76; 51; 45; 41 


Human hgl9 


CHB population in 1000G 


454; SE 


99.42 


8.018 


400.673 [266;529] 


Human hg19 


MASON simulated 


Ion Torrent 


77.02 


20.85 


223.99 [59;398] 


E. coli strain 536 


Ion Torrent released 


SOLiD 


55.64 


126.81 


50 


Human hg19 


CHB population in 1000G 


Pacific Biosciences* 


85.79 


0.69 


698.61 [48;6084] 


V. cholerae 4,033,464 bp. 


Pacific Biosciences released 



*The parameter set "-hs 10 -mmp 0.5 -act 15" was used as opposed to the default values "-hs 15 -mmp 0.15 -act 55". 
With the exception of the Pacific Biosciences data, all alignments were generated using MOSAIK's default parameters. 
doi:1 0.1 371 /joumal.pone.0090581 .t001 



only those reads with mapping quality values greater than or equal 
to twenty. It can be seen that the PPVs of BWA and MOSAIK are 
comparable and are significantly better than those achieved by 
BOWTIE and STAMPY. For mapping quality cutoffs smaller 
than five, BWA is more accurate (fewer incorrect alignments 
among total mapped alignments) than MOSAIK, however, 
MOSAIK is the most accurate as the mapping quality cutoff is 
increased. For a mapping quality cutoff of twenty (a common 
cutoff employed by downstream analysis tools that only wish to 
consider confidently aligned reads), the PPVs of MOSAIK, BWA, 
BOWTIE and STAMPY are 100.00%, 99.99%, 99.79% and 
99.63% respectively. These results are summarized in Table 2. 

Figure 2 shows receiver operating characteristic (ROC) curves 
for the same data. The total number of mapped reads (x axis) is 
plotted against the number of incorrectly mapped reads (y axis). 
Each point on the curve represents the number of alignments 
whose mapping qualities are greater than or equal to the indicated 
value. MOSAIK has a relatively smooth curve, ensuring that 
downstream tools that employ mapping quality cutoffs (i.e. 
ignoring all reads with mapping qualities less than the cutoff) do 
not incur extremely large changes in the number of reads while 



progressively increasing the cutoff. Conversely, the other aligners 
do not share this property. For example, consider the BWA 
alignments. By decreasing the mapping quality cutoff from 30 to 
29, the number of incorrecdy mapped reads increases by 308.56% 
while for MOSAIK, the increase is a much more modest 6.25%. 
Downstream analysis tools require a useful mapping quality scale, 
so that excluding lower quality reads improves the specificity of the 
analysis results. The dynamic range demonstrated by MOSAIK is 
therefore a very valuable result for these tools. 

Mapping quality calibration 

The Phred mapping quality score present in the standard 
SAM/BAM format represents the probability that the read was 
mapped incorrectly and is defined as: 



101og 10 P, 



(Equation 1) 



where Q, is the Phred score and P is the probability that the read 
was misaligned. For example, a read assigned a Phred mapping 
quality score of 30 has a 1 in 1000 chance of being misaligned. 



(A) 



0.999 



cl 0.998 



0.99r 
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Figure 1. The positive predictive value of aligners (the number of correctly mapped reads divided by the total number of mapped 
reads) as a function of mapping quality threshold. Datasets in (A) 100 bp and (B) 76 bp read lengths. PPV, TP, and FP stand for positive 
predictive value, true positive and false positive, respectively. 
doi:10.1371/journal.pone.0090581.g001 
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Table 2. The positive predictive values (the number of correctly mapped reads divided by the total number of mapped reads) in 
terms of mapping quality cutoffs. 





MQ cutoffs 


30 




20 




10 




0 




Read lengths 


100 


76 


100 


76 


100 


76 


100 


76 


BWA 


1 


1 


0.9998 


0.9997 


0.9996 


0.9995 


0.9971 


0.9959 


BOWTIE 


0.9998 


0.9992 


0.9982 


0.9976 


0.9980 


0.9972 


0.9823 


0.9819 


STAMPY 


0.9961 


0.9986 


0.9945 


0.9982 


0.9897 


0.9954 


0.9813 


0.9909 


MOSAIK 


1 


1 


1 


1 


0.9999 


0.9999 


0.9962 


0.9947 



doi:1 0.1 371 /journal.pone.0090581 .t002 



MOSAIK's mapping qualities are obtained using a neural 
network that approximates the error function when provided with 
features such as best and second best Smith- Waterman alignment 
scores, read entropy, number of potential mapping locations and 
hashes. For paired-end reads, the fragment length of mapped 
paired end reads is also used in the neural network to produce 
more precise mapping quality calculations. MOSAIK embeds the 
Fast Artificial Neural Network (FANN) library (http:/ /leenissen. 
dk/fann/wp/), which implements multilayer artificial neural 
networks in C, supporting both fully connected and sparsely 
connected networks, to calculate Phred score for each alignment. 

The default neural network provided with MOSAIK was 
generated by training on the human genome. The first step 
involves simulating reads and then aligning them to the human 
reference genome to obtain MOSAIK's behaviour such as best 
and second best Smith-Waterman scores, and numbers of 
obtained mappings and hashes. Then, the neural network was 
trained based on MOSAIK's behaviour. 

Figures 3(A) and 3(B) compare the actual (calculated using 
Equation (1)) and the assigned mapping quality scores. Both, 
BOWTIE and MOSAIK produce very accurate Phred score 
mapping qualities across the whole quality score spectrum. The 
Pearson correlation coefficients between the assigned and actual 
quality scores are shown in Table 3. MOSAIK has an average 
(across all read lengths investigated) correlation coefficient of 



0.9698, compared with 0.9061, 0.9207, and 0.8652 for BWA, 
BOWTIE, and STAMPY respectively. 

Retraining Mapping-Quality Neural Network for E. coli 
Alignment 

The genomes of different species differ in many respects 
including sequence content (base composition as well as relative 
frequency of repeat or low-complexity sequence) as well as the size 
of the genome. Most aligners, including MOSAIK, are general 
programs that can operate on any given reference genome, 
however, in general, the properties of the genome under 
investigation are ignored. MOSAIK provides a retrainable 
mapping-quality pipeline to generate applicable neural networks 
for different genomes or sequencing technologies. This means that 
the calibration of the mapping quality scores remains of a very 
high quality, regardless of reference genomes. 

To demonstrate the merit of the retrainable mapping-quality 
pipeline, we used 6 million simulated paired-end reads from the E. 
coli genome to train a neural network (see supplemental method 
(A): Retraining Mapping Quality Neural Network). An additional 
independent set of 6 million simulated E. coli paired-end reads 
were then generated and aligned to the E. coli genome using 
multiple aligners. The assigned and actual mapping quality scores 
are plotted for all aligners in Figure 3(C). There are two sets of 



Read length: 100bp 




Read length: 76bp 



2 3 4 
# of Alignments 




x 10 u 



2 3 4 
# of Alignments 



x 10 



Figure 2. The receiver operating characteristic (ROC) curves; Datasets in (A) 100 bp and (B) 76 bp read lengths. Each point represents 
the total numbers of alignments whose mapping qualities are greater than the indicated value. MOSAIK has a relatively smooth curve, ensuring that 
downstream tools that employ mapping quality cutoffs (i.e. ignoring all reads with mapping qualities less than the cutoff) do not incur extremely 
large changes in the number of reads while progressively increasing the cutoff. 
doi:1 0.1 371 /journal.pone.0090581 .g002 
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(C) 



100 



80 



E. coli reads 



[Optimal] 

BWA 
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MOSAIK 
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Assigned Mapping Quality 
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Figure 3. The correlations between the aligners' assigned and actual mapping qualities. Phred score scheme. (A) and (B) simulated 
datasets in 100 bp and 76 bp read lengths. (C) E. coli simulated dataset in which "MOSAIK" is MOSAIK's default mapping-quality network trained by 
human genome while "MOSAIK-retrained" is the retrained mapping-quality network by using £ coli simulation and £ coli genome. The detailed 
numbers of the Pearson's correlation coefficients are given in Table 3. 
doi:1 0.1 371 /journal.pone.0090581 .g003 



data for MOSAIK: the first (red crosses) is generated using the 
default neural network trained on the human genome, and the 
second (dark red diamond) uses the neural network retrained on 
the E. coli genome. It is clear that the mapping qualities generated 
by the retrained neural network for MOSAIK are the best 
calibrated, although the data using the human genome trained 
neural network is still of a high quality. Also of note, Figures 3(A) 
and 3(B) show that BOWTIE has quite well calibrated mapping 
qualities for mapping to the human genome, however, when 
applied to E. coli, the calibration is noticeably worse. 

MOSAIK accurately accounts for short INDELs 

MOSAIK uses a Smith-Waterman (SW) algorithm as the final 
polishing step to produce pairwise read alignments, which is the 
preferred choice for aligning gapped (short INDELs) sequences 
since it seeks all possible frames of alignment with all possible gaps. 
To assess the sensitivity of different aligners to short INDELs, we 
simulated Illumina paired-end reads containing 1-14 bp INDEL 
events that are generated by a genome simulator, MUTATRIX 
(https://github.com/ekg/mutatrix). For each INDEL length, we 
introduced an average of 100 events, with approximately 800 
spanning reads (see supplemental Figure S4). 

Figures 4(A) and 4(B) plot the sensitivity (number of correcdy 
mapped reads divided by the total number of simulated reads) as a 
function of the INDEL length. An alignment is considered correct 
when it is mapped to the correct position as well as contains the 
simulated variant. Alignments containing the correct variants can 
facilitate downstream variant detectors detecting variants depend- 

Table 3. Pearson's correlation coefficients of mapping 
qualities. 



Read lengths 


100 


76 


£ coli 


BWA 


0.8987 


0.8625 


0.8936 


BOWTIE 


0.9027 


0.9449 


0.6989 


STAMPY 


0.8317 


0.8818 


0.5262 


MOSAIK 


0.9609 


0.9497 


0.8881 


MOSAIK-retrained 






0.9749 



doi:1 0.1 371 /journal.pone.0090581 .t003 



ing on alignments and need no any realignment step which is 
timing consuming. MOSAIK is the most sensitive aligner 
considered here when considering deletions. When considering 
insertions, MOSAIK's sensitivity is comparable to, but slightly 
worse than those of STAMPY and BOWTIE. It is clear from 
Figures 4(A) and 4(B) that MOSAIK is the only mapper 
considered here that is highly sensitive to both insertion and 
deletion polymorphisms. We understand that some aligners tend 
to report partial alignments that may not contain variants but are 
mapped to right places. Those alignments still provide values for 
variant detections. We thus change the criteria of correct 
alignments used in Figures 4(A) and 4(B). In figures 4(C) and 
4(D), an alignment is considered a correct mapping when it is 
entirely or partially mapped to the correct positions. The four 
aligners achieve 96% sensitivity based on the criteria. 

Effect of mapping errors on SNP studies 

Aligners provide information on where reads map in the human 
genome along with information on the confidence of the mapping, 
however, they do not themselves weigh evidence for genetic 
variants in the genome being studied. Dedicated variant callers use 
the information provided by mapper in statistical models to 
determine if there is enough evidence to report a difference with 
respect to the reference genome. To determine the effect of the 
mapping on single nucleotide polymorphism (SNP) discovery, we 
simulated 1,486 SNPs on the human genome chromosome 20 
using MUTATRIX. We then used MASON to generate 12 
million reads (with read lengths of 76 and 100 basepairs) from this 
mutated chromosome generated by MUTATRIX. The same four 
aligners were then used to align these reads back to the entire 
human reference genome and the variant callers FREEBAYES 
[33] and SAMTOOLS [34] were used to call SNPs. Figure 5 
shows the variant callers sensitivity to SNPs as a function of the 
false discovery rate (FDR) (the complete information is shown in 
Figure S3). The points on the curves are generated by only 
considering SNP calls with variant quality scores (provided by the 
variant caller) greater than a specific cutoff. Moving from lower- 
left to upper-right, SNP calls with lower quality scores are 
cumulatively being included. Both FREEBAYES and SAM- 
TOOLS produce lower sensitivity calls on the BOWTIE 
alignments and have a lower FDR on BWA and MOSAIK 
alignments. It is clear from both Figures 5(A) and 5(B) that the 
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(A) Read length: 100bp (B) Read length: 76bp 




-15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 

INDEL length INDEL length 




Figure 4. The sensitivities of simulated reads spanning INDELs, which is defined as the number of correct mapped reads divided by 
the number of simulated reads for each INDEL length. In (A) and (B), the alignments are considered correct as they cross INDELs, while in (C) 
and (D), the alignments are considered correct as they are entirely or partially mapped to the correct positions. TP and FN are "true positive" and 
"false negative" respectively. 
doi:1 0.1 371 /journal.pone.0090581 .g004 



most sensitive SNP calls are produced when using the MOSAIK 
alignments, although the BWA alignments are also of a high 
quality. It is also worth noting that the SNP calls produced by 
FREEBAYES are more sensitive than those produced by 
SAMTOOLS regardless of the mapper used. 

Support for mobile element insertion 

Detecting structural variations using NGS data is a more 
complex task than detection of short variants and often requires or 
would benefit from information over and above that ordinarily 
required for small variant detection. An increasing number of SV 
detection algorithms are being developed and, in order to increase 
the effectiveness of these algorithms, MOSAIK has been 
developed to provide as much relevant and useful information as 
possible. 

There are many genetic sequences that can be considered 
distinct from the standard set of chromosomes in the genome 
under investigation. These can include repetitive sequences such 
as mobile elements [35], viruses (e.g. human endogenous 



retroviruses [36]), known novel insertions [37,38] or bacterial 
contaminants [39] amongst others. MOSAIK provides support for 
an additional reference genome file containing any genetic 
sequences provided by the investigator. The advantages of this 
are two-fold: a) reads originating from contaminants will map to 
the additional sequences, rather than a lower quality mapping to 
the best location in the standard reference genome. These 
sequences essentially act as a sink to catch all the reads that do 
not originate from the standard reference, reducing the number of 
mismapped reads that variant detectors have to contend with, b) 
Reads mapping to repetitive elements (e.g. ALU or LINE 
elements) are identified as mapping to the additional reference 
sequence. MOSAIK reports the coordinates of the best mapping 
in reference genome coordinates, but also includes an additional 
tag in the BAM file (appearing as ZA in the BAM file), indicating 
that the read maps to one of the additional reference sequences. 
Our MEI detector, TANGRAM (https://github.com/jiantao/ 
Tangram) looks for read pairs with one mate uniquely aligned to 
the genome and the other mate falls within a mobile element 
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(A) FREEBAYES (B) SAMTOOLS 




FDR = FP / (FP + TP) FDR = FP/(FP + TP) 



Figure 5. The receiver operating characteristic (ROC) curves of SNPs called by FREEBAYES and SAMTOOLS. The points on the curves 
are sorted by called qualities and the points closer to the upper-right corner have higher called qualities. The true positive (TP), false positive (FP), and 
false negative (FN) are calculated by intersecting SNPs called on each aligner's alignments and gold SNPs called on the simulated alignments. 
doi:1 0.1 371 /journal.pone.0090581 .g005 



reference sequence. Only relying on this information provided by 
MOSAIK, the sensitivity of MEI detection can achieve 84%. 

Applications 

SNP and INDEL Analyses in the 1000 Genomes 
Project. The 1000 Genomes Project is in the process of using 
second-generation sequencing instruments to study human genetic 
variation at the population level. The Phase I [31] release, based 
on a population of 1,092 sample individuals in 14 populations 
includes approximately 38 million SNPs, 1.4 million bi-allelic 
INDELs and 14,000 large deletions. These calls were generated 
from approximately 966 billion reads and 64 trillion base pairs of 
human DNA and were sequenced using Illumina, AB SOLiD and 
Roche 454 for both low-coverage whole-genome and exome 
targeted sequencing data. A collaborative effort between Boston 
College and the National Center for Biotechnology Information 
(NCBI) used MOSAIK to align all of the reads from all of these 
machines, and served as the official primary alignment set for the 
exome sequencing data [40] and an alternative alignment set for 
the low-coverage. 

Based on the MOSAIK alignments, SNP, MNP (multi- 
nucleotide polymorphism) and INDEL calls were generated using 
the FREEBAYES Bayesian variant calling software. 33,324,407 
SNPs were detected in the autosomes of the 1,092 samples, of 
which, only 23.8% were previously known sites (contained in 
dbSNP). The transition/ transversion (ts/ tv) ratio for these sites was 
2.12 (2.1 for novel sites and 2.17 for known sites). The Illumina 
exome data yielded 344,781 SNPs with a ts/tv ratio of 3.18 (3.09 
for the novel sites and 3.52 for the known sites) and 22.1% of the 
exome sites were previously known. The SOLiD exome data 
yielded 1 76,637 SNPs with a ts/tv ratio of 3.34 (3.22 for novel sites 
and 3.58 for known sites). The ts/tv ratios are in accordance with 
expectations for both the low-coverage and the exome SNPs. 

Other SNP Studies. In addition to the 1000 Genomes 
Project, MOSAIK is widely used for other human clinical genome 
studies, such as human cancer studies [41-46]. MOSAIK is also 
used for other species genome studies including model species 
[47,48], HIV [49-52], parasites [53-55], plants [56-58], and 
other animals [59,60]. 



Human Mobile Element Insertion Discovery 

In addition to short variants, the 1000 Genomes Project aims to 
characterize larger structural variations present in the human 
population. By augmenting the reference genome with known 
mobile element insertions (MEI), the MOSAIK alignments were 
able to provide a host of information about their distribution in the 
human population. As part of the pilot phase of the project, 7,380 
MEI polymorphisms were detected using the whole-genome 
sequencing data [61]. This sample set included 60 samples of 
European origin (CEU), 59 African (YRI) and 60 Asian samples 
from Japan and China (CHB/JPT). The FDRs for Alu, LI, and 
SVA insertions were 2%, 17%, and 27% respectively. 

Discussion 

MOSAIK is a highly sensitive, stable and flexible reference- 
guided read mapper which supports most existing sequencing 
technologies. While MOSAIK is extremely accurate (positive 
predictive values achieve 99.5% for all alignments and 100.0% for 
alignments whose mapping qualities are larger than 20 on 
simulated data), not all reads are aligned with equal confidence. 
The mapping qualities that MOSAIK provides are generated 
using a retrainable neural network and are a very good 
representation of the probability of the alignment being incorrect. 
In fact, the correlation coefficient between MOSAIK assigned and 
the actual mapping qualities is 0.97. The retraining pipeline 
ensures optimized mapping quality score schemes for any genome 
being studied. For example, when considering aligning against the 
E. coli genome, the correlation coefficient increases from 0.89 to 
0.97 when using the human and the E. coli neural nets respectively. 
By using the Smith-Waterman algorithm, MOSAIK is very 
effective at mapping reads containing short INDELs and the 
experiments demonstrate that the sensitivity of INDEL mappings 
is greater than 90%. Additionally, MOSAIK provides explicit 
support for SV detections. 

Most SV detectors make extensive use of information from 
paired end reads [62-64]. If the two mates in a pair map to gready 
separated locations (often the case when the read pair spans or falls 
within a structural variant), multiple searches through the BAM 
files are required to assemble all of the information about both 
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mates. This can be a lengthy task, severely impacting the 
performance of SV detectors. The ZA tag provides a host of 
information about the reads mate, including the location, mapping 
quality, number of mappings for the mate, which ensures that 
these searches are not required, created vast increases in the 
efficiency of the SV detectors using this information. 

The other utility for SV detections is reporting all possible 
mappings. Many genomes contain regions that are considered 
unmappable, usually due to the presence of low complexity DNA. 
Depending on the algorithms employed, NGS reads can still map 
to these regions; however, it is often prudent to omit these reads 
from variant detection. Instead of discarding reads mapping to 
multiple locations or picking the best quality alignment, MOSAIK 
records all locations to which a read maps (given the constraints 
imposed by the selected parameters) and records them in a 
separate BAM file. Since the number reads mapping to multiple 
locations as well as the number of entries for each multiply aligned 
read can be extremely large, the resulting BAM file has the 
potential to be excessively large. By default, MOSAIK omits much 
of the read specific information (e.g. read name, sequence and 
error information), allowing for effective compression of the file 
after positional sorting, resulting in very small BAM files. The 
information contained in these BAM files allows easy identification 
of genomic regions where many individual reads are aligning. 
These regions are those that can be considered unmappable, since 
reads hitting these regions are also able to align to other genomic 
regions. Thus they provide a guide to the mappable genome which 
can greatly aid in variant discovery. 

The default parameters used by MOSAIK were optimized 
using simulated Illumina datasets from the human genome. They 
were generated to provide a balance between mismatches and 
gaps in the alignments, leading to balanced calling of SNPs and 
INDELs by variant callers. For the experimenter only interested in 
a specific variant type, it is possible to modify the parameters to 
provide alignments more sensitive for the variant type of interest. 
For example, if INDEL discovery is paramount, reducing the 
Smith-Waterman penalty for the creation and extension of gaps in 
alignments will lead to a greater likelihood that INDELs will be 
discovered. 

MOSAIKs memory footprint depends on the size of the 
reference hash-table which, in turn, depends on the hash (k-mer) 
size as well as the length of the reference sequence. For the human 
genome using the default value of k= 15, MOSAIK requires 
approximately 20Gb of memory. For machines with less available 
RAM, MOSAIK can be run in a low-memory mode that performs 
alignments chromosome by chromosome. This reduces the 
required memory to 7Gb, which makes MOSAIK accessible to 
most machines. 

Improvements in the computational performance can be 
achieved at the expense of decreased sensitivity, but ongoing 
development (including replacing the traditional Smith-Waterman 
algorithm with a single-instruction-multiple-data (SIMD) Smith- 
Waterman algorithm [65,66]) provides significant performance 
improvements. Initial testing of the SIMD Smith-Waterman 
algorithm demonstrate a twofold speed up [65] . Further improve- 
ment is achieved by reducing the number of applications of the 
Smith- Waterman algorithm for each read. Reads that originate in 
highly repetitive sequence can produce tens of thousands of 
candidate loci (see supplemental Figure S6) in the genome and the 
Smith- Waterman algorithm is applied to each one of these regions. 
This is extremely computationally intensive with very little benefit 
to the alignment sensitivity. As a result, if there are greater than a 
preset number (the default is 200) of potential mapping loci, 
MOSAIK only invokes the Smith- Waterman algorithm on the top 



200 loci. MOSAIK then reports the most confident alignment 
from all of the regions in which the Smith- Waterman algorithm 
was applied. Supplemental Table SI demonstrates that MOSAIK 
2.2.3 (with these modifications) is of the order of five times faster 
than version 2.1.78 (without the modifications). Importantly, these 
modifications do not adversely impact the sensitivity of MOSAIK. 

Methods 

Overview 

MOSAIK is a hash-based aligner and it hashes reference 
sequences as its first step. MOSAIK splits the reference sequences 
into overlapping contiguous £-mers (hashes) and stores the 
positions of each hash in a hash table data structure that 
guarantees 0(1) lookups. Then, MOSAIK hashes each read in 
the same hash size and looks hashes up in the hash table to obtain 
the genomic positions of the hashes of a read. Next, nearby hash 
positions are consolidated as a hash region (hashes of a read may 
be clustered as several hash regions) where a Smith-Waterman 
algorithm is applied to align the read to the local region of a 
genome reference sequence as a final "polishing" step. For paired- 
end reads, each end-mate of a read is mapped separately. For 
some cases, that may be one end-mate aligned well and the other 
one failing to be aligned. The well-aligned mate can be used to try 
and rescue the unaligned mate using knowledge of the approx- 
imate fragment length used in the paired-end read generation. 

Processing Reference Sequences 

MOSAIK can handle a nearly unlimited number of reference 
sequences, however, the maximum aggregated reference length is 
four billion bases. Alignments to the human transcriptome using 
more than 95,000 individual reference sequences are easily 
handled. The available hash sizes are 4—32. 

MOSAIK supports the full set of IUPAC ambiguous nucleotide 
characters. This allows users to use reference sequences that have 
been masked by confirmed dbSNP (http://www.ncbi.nlm.nih. 
gov/projects/SNP/) calls. The ambiguity codes minimize the 
alignment bias that might be caused when aligning to reference 
sequences containing SNPs. For considering IUPAC, MOSAIK 
substitutes ambiguous codes with all of the alternative bases 
represented by the ambiguity code and stores the resulting hashes 
in the hash table. In order to avoid increasing the size of the jump 
database dramatically, the ambiguity codes N and X are not 
considered when hashing the reference sequences. 

Clustering Hashes 

MOSAIK supports various read formats (SRF, FASTA, 
FASTQ, Bustard, and Gerald). In each case, the reads are split 
into a set of overlapping hashes and the genomic positions of each 
hash are queried from the stored reference hash table. A modified 
AVL tree [67] is employed to handle and cluster nearby hash 
positions to form a hash region. The clustering algorithm considers 
sequencing errors, SNPs and single base INDELs. For example, 
consider a 35 bases read split into hashes of 15 bases. The first 
hash consists of the first 1 5 bases in the read. The second hash 
consists of bases 2-16 in the read and so on. The read consists of 
22 individual hashes, each of which is associated with positions 
within the reference genome. If the read can be aligned perfectiy 
to the somewhere in the reference genome (i.e. there are no 
sequencing errors or variations), each of the 22 hashes will have a 
reference genome position offset by a single base (i.e. if the first 
hash in the read is associated with the reference position x, the 
second hash with the reference position x+1 etc.). The AVL tree 
will consolidate those hits into a single alignment candidate region 
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(see Supplemental Figure S5(A)). The presence of a single 
sequencing error will ensure that 15 of the hashes (each hash 
overlapping the error), will not be associated with the correct 
genomic coordinate. Since the clustering algorithm considers 
sequencing errors, however, an alignment candidate region is still 
present in the AVL tree (see Supplemental Figure S5(B)). 

Applying Smith-Waterman Algorithm 

After identifying alignment candidate regions, MOSAIK 
employs a Smith-Waterman algorithm to align reads to the 
alignment candidate regions. The Smith-Waterman algorithm, 
which was invented over 30 years ago, is still regarded as the most 
accurate pairwise alignment algorithm and the preferred choice 
for aligning gapped sequences since it seeks all possible frames of 
alignment with all possible gaps. Specifically, the alignments are 
performed using the Smith-Waterman-Gotoh alignment algorithm 
[68,69]. 

The time complexity of the Smith- Waterman algorithm is 0(n 2 ), 
which may render the mapper useless due to poor performance. 
To address this, a banded Smith-Waterman algorithm [70] has 
been implemented to improve the performance. According to our 
experiments, the runtimes for aligning Illumina and Roche 454 
data are reduced by approximately 3x and 8x respectively. The 
further development of using SIMD SW promises significant 
performance improvements. 

Rescuing Paired-End Mates 

Each mate in a paired-end read is initially aligned individually. 
There are various factors that lead to some reads failing to be 
aligned to the reference. In the case of paired-end reads, the 
aligned mate can be used to try and rescue the unaligned mate 
using knowledge of the approximate fragment length used in the 
paired-end read generation. A local alignment search algorithm 
has been implemented which performs a Smith-Waterman 
algorithm in the region proximal to the aligned mate. If the read 
exhibits the expected strand, orientation, and fragment length, the 
read is considered rescued. Even if both mates in the pair are 
successfully aligned, the local alignment search may still be 
triggered, if the alignments are inconsistent with the expected 
fragment length. 

The number of mates rescued by the local alignment search 
depends largely on the read lengths considered. With increasing 
read length, the aligner is less likely to miss a potential alignment 
and therefore fewer alignments are rescued. 

Handling AB SOLiD reads 

AB SOLiD reads are represented in the colorspace rather than 
in the more conventional basespace. Most downstream applica- 
tions do not support colorspace and thus alignments require 
conversion to basespace for maximum utility. MOSAIK is 
equipped to align colorspace reads against a colorspace reference 
and then convert the resulting alignments into basespace. The di- 
base quality conversion algorithm uses the minimum of the two 
qualities that overlap a nucleotide in basespace. This approach 
allows users to specify parameters, such as the maximum number 
of mismatches. Additionally, it enables users to merge aligned 
SOLiD datasets with datasets from other sequencing technologies. 

Known-Sequence Insertion Detections 

MOSAIK is aware of user-specified insertion sequences, e.g. 
mobile element insertions. When the insertion sequences are 
provided, the reference hashes are prioritized such that alignment 
to the given insertion sequences are attempted prior to alignment 



to the genome reference. An additional tag in the BAM file (the 
ZA tag) then indicates any alignments of a read hitting the given 
insertion sequences. Since MEIs are repetitive elements, a read 
from an MEI can be mapped to several locations within the 
genome (potentially hundreds of locations). The ZA tag then 
populated with valuable information about the reads mate, 
including location, mapping quality and number of mapping 
locations for the mate. This information ensures that multiple 
BAM search operations (which can be lengthy for large BAM files) 
can be avoided. The downstream MEI detector can detect MEI by 
using ZA tag easily. 

Supporting Information 

Figure SI The distributions of alignments' softclips. 

(TIF) 

Figure S2 The complete information of Figure 1. The 

positive predictive value of aligners (the number of correcdy 
mapped reads divided by the total number of mapped reads) as a 
function of mapping quality threshold. Datasets in (A) 100 bp and 
(B) 76 bp read lengths. PPV, TP, and FP stand for positive 
predictive value, true positive, and false positive, respectively. 
(TIF) 

Figure S3 The complete information of Figure 5. The 

receiver operating characteristic (ROC) curves of SNPs called by 
FREEBAYES and SAMTOOLS. The points on the curves are 
sorted by called qualities and the points closer to the upper-right 
corner have higher called qualities. The true positive (TP), false 
positive (FP), and false negative (FN) are calculated by intersecting 
SNPs called on each aligner's alignments and gold SNPs called on 
the simulated alignments. 
(TIF) 

Figure S4 The short INDELs that are inserted for 
investigating the aligners' abilities for them, and the 
read coverage for each length INDEL. 

(TIF) 

Figure S5 MOSAIK hash clustering. (A) The read uniquely 
aligns perfectiy to the references, all hashes will succeed in finding 
the adjacent reference locations and the AVL tree will consolidate 
those hashes into one alignment candidate region. (B) However, if 
only one hash succeeds in finding the proper reference location 
because of sequencing errors, an alignment candidate region is still 
present in the AVL tree. 
(TIF) 

Figure S6 The distribution of candidate loci in the 
genome of reads. MOSAIK applies a Smith-Waterman 
algorithm to each candidate locus of a read to generate an 
alignment. Therefore, the number of candidate loci is equal to the 
number of executed the Smith- Waterman algorithm. The mhp of 
MOSAIK is the maximum number of investigated hash positions 
per 15-mer. 
(TIF) 

Method SI The methods of (A) Retraining Mapping 
Quality Neural Network and (B) Detecting Specified 
Insertion Sequences. 

(PDF) 

Table SI The runtime of each mapper for aligning six 
million 100 bp reads. The version without '*' are the exact 
version of each mapper for which we report performance 
comparisons. For up to date information, we also report speed 
for the current version (indicated by '*') of each software. 
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STAMPY is a single-threaded program and thus the runtime of 

using 4 cpus is not available. 

(PDF) 
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